[TLE][MTHREADS] Support TLE Structure on mthreads backend by Kylin1207 · Pull Request #617 · flagos-ai/FlagTree

Kylin1207 · 2026-05-26T09:20:19Z

MTHREADS backend support for the main TLE Structure primitives in this patch:

tle.gpu.memory_space(x, "shared_memory")
- Marks a ranked tensor for shared-memory materialization.
- Load inputs can lower through async global-to-shared copy.
- Non-load tensor inputs materialize through initialized ttg.local_alloc + ttg.local_load.
- Only "shared_memory" is supported on mthreads; "tensor_memory" and other spaces are rejected.
tle.gpu.alloc
- Supports shared-memory buffers backed by ttg.local_alloc.
- Supports explicit swizzled shared layouts and initialized allocations.
- nv_mma_shared_layout=True/default is not supported on mthreads.
- tmem allocation is not supported.
tle.gpu.local_ptr
- Supports full-view and indexed shared-memory pointers, scalar and tensor indices, 1D/2D use cases, local load/store, masked tails, loops, dot operands, and runtime round trips.
- Adds automatic barrier insertion for local pointer load-after-store hazards.
- Adds optimizations that rewrite eligible full-view local pointer loads/stores to memdesc ops.
- Extends Triton atomic operand type handling for address-space-3 shared-memory pointers.
- Limitations: indices must be integer typed; scalar/tensor indices cannot be mixed; index rank must match buffer rank; only shared-memory buffers are supported.
tle.gpu.copy
- Supports normal global-memory <-> shared-memory copies using tle.buffered_tensor.
- Supports descriptor/TME-style copies in both directions: descriptor -> smem and smem -> descriptor.
- Validates shape, dtype, buffer storage, descriptor offsets, and offset rank.
- Descriptor copy requires offsets.
- Normal copy currently requires the pointer tensor shape to exactly match the buffer copy shape.

Lowering path

Key differences from native Triton:

tle.gpu.memory_space(..., "shared_memory") is consumed early; no tt.memory_space marker remains after lowering.
tle.gpu.local_ptr introduces musa_tle.local_pointers, which is later optimized or lowered away before LLVM IR.
Descriptor-based tle.gpu.copy uses ttg.tma_copy as an intermediate but lowers to mthreads/MUSA TME ops such as
ttmg.async_tme_copy_global_to_local, ttmg.async_tme_copy_local_to_global, and LLVM MUSA TME intrinsics, instead of native Triton TME lowering.
Normal tle.gpu.copy lowers through load/store plus local pointer paths, with mthreads-specific async-store optimization.

Performance Data

Benchmark source:

python/tutorials/tle/01-fft.py.
python/tutorials/tle/03-topk.py.

Note:
For MTHREADS testing, the tutorial currently requires manually replacing is_cuda with is_musa before running.

Environment:

Driver: 4.3.5
SDK: 5.1.0
Torch: 2.9.0

Baselines and results on large-shape cases:

TLE Radix TopK vs Triton TopK: 3.3x speedup.
TLE Radix TopK vs Torch TopK: 1.2x speedup.
TLE FFT vs Triton FFT: 1.9x speedup
TLE FFT vs Torch FFT: 20x speedup

sunnycase

Thanks for the contribution and for adding TLE structure support for the mthreads backend.

Could you please update the PR description or add supporting documentation to explain which TLE primitives are implemented/supported by this work, and include performance benefit data so reviewers can evaluate whether the implementation scope matches the expected value?

It would be helpful to include:

The list of implemented TLE primitives, their semantic coverage, and any partial support or known limitations.
The lowering/runtime path for each key primitive, especially where it differs from the native Triton path.
Performance data: benchmark cases, input sizes, hardware/driver environment, baseline, before/after results, improvement ratio, and any regression cases.
If this PR is currently only structural enablement and has no measurable performance gain yet, please state that explicitly and describe the follow-up validation plan.

zhzhcookie

LGTM

sunnycase

LGTM

Kylin1207 added 10 commits May 23, 2026 16:45

[CHORE] Ignore FLAGTREE_BACKEND and mthreads build artifacts

1acbb5a

[TLE][MTHREADS] Support TLE memory_space on mthreads backend

503a913

[TEST][MTHREADS] Deduplicate mthreads TLE test helpers

7654831

[TLE][MTHREADS] Support TLE alloc on mthreads backend

b1c1478

[TLE][MTHREADS] Support TLE local_ptr on mthreads backend

a454f6a

[TLE][MTHREADS] Clarify mthreads TLE frontend and dialect ownership

bfb3f6e

[TLE][MTHREADS] Support TLE copy on mthreads backend

38a529e

[TLE][MTHREADS] Support atomic operands

4b4f2c2

[TLE][MTHREADS] Avoid illegal fp16 local pointer async copies

64ccc8a

[TLE][MTHREADS] Remove comments

d1080e2

Kylin1207 requested review from sunnycase and zhzhcookie as code owners May 26, 2026 09:20

github-actions Bot added mthreads triton_v3.6.x labels May 26, 2026

sunnycase reviewed May 29, 2026

View reviewed changes

[TEST][MTHREADS] clear max_shared_mem cache after runtime limit test

7625469

zhzhcookie approved these changes Jun 2, 2026

View reviewed changes

sunnycase approved these changes Jun 3, 2026

View reviewed changes

sunnycase merged commit fea4914 into flagos-ai:triton_v3.6.x Jun 3, 2026
11 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TLE][MTHREADS] Support TLE Structure on mthreads backend#617

[TLE][MTHREADS] Support TLE Structure on mthreads backend#617
sunnycase merged 11 commits into
flagos-ai:triton_v3.6.xfrom
Kylin1207:pr/mthreads/dev_tle_structure

Kylin1207 commented May 26, 2026 •

edited

Loading

Uh oh!

sunnycase left a comment •

edited

Loading

Uh oh!

zhzhcookie left a comment

Uh oh!

sunnycase left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Kylin1207 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MTHREADS backend support for the main TLE Structure primitives in this patch:

Lowering path

Performance Data

Uh oh!

sunnycase left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhzhcookie left a comment

Choose a reason for hiding this comment

Uh oh!

sunnycase left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kylin1207 commented May 26, 2026 •

edited

Loading

sunnycase left a comment •

edited

Loading